Extracting Clusters from Large Datasets with Multiple Similarity Measures Using Imscand
نویسندگان
چکیده
We consider the problem of how to group information when multiple similarities are known. For a group of people, we may know their education, geographic location and family connections and want to cluster the people by treating all three of these similarities simultaneously. Our approach is to store each similarity as a slice in a tensor. The similarity measures are generated by comparing features. Generally, the object similarity matrix is dense. However it can be stored implicitly as the product of a sparse matrix, representing the object-feature matrix, and its transpose. For this new type of tensor where dense slices are stored implicitly, we have created a new decomposition called Implicit Slice Canonical Decomposition (IMSCAND). Our decomposition is equivalent to the tensor CANDECOMP/PARAFAC decomposition, which is a higher-order analogue of the matrix Singular Value decomposition (SVD) and Principal Component Analysis (PCA). From IMSCAND we obtain compilation feature vectors which are clustered using k-means. We demonstrate the applicability of IMSCAND on a set of journal articles with multiple similarities.
منابع مشابه
A Hybrid Time Series Clustering Method Based on Fuzzy C-Means Algorithm: An Agreement Based Clustering Approach
In recent years, the advancement of information gathering technologies such as GPS and GSM networks have led to huge complex datasets such as time series and trajectories. As a result it is essential to use appropriate methods to analyze the produced large raw datasets. Extracting useful information from large data sets has always been one of the most important challenges in different sciences,...
متن کاملAn Empirical Comparison of Distance Measures for Multivariate Time Series Clustering
Multivariate time series (MTS) data are ubiquitous in science and daily life, and how to measure their similarity is a core part of MTS analyzing process. Many of the research efforts in this context have focused on proposing novel similarity measures for the underlying data. However, with the countless techniques to estimate similarity between MTS, this field suffers from a lack of comparative...
متن کاملEfficient Clustering of High Dimensional Datasets with Multi Viewpoint Based Similarity Measure
Many important real time applications involve clustering large datasets. Dataset can be large if there are a large number of elements in the data set, each element can have many features and there can be many clusters to discover. Recent advances in clustering algorithms have been addressed these datasets issues partially. However, there has been much less work on methods of efficiently cluster...
متن کاملA Novel K means Clustering Algorithm for Large Datasets Based on Divide and Conquer Technique
In this paper we propose an efficient algorithm that is based on divide and conquers technique for clustering the large datasets. In our research work we have applied divide and conquer technique on partitions of the large datasets and we have used squared Euclidean distance for measuring the similarity between data points. The partitioning of datasets is done according to the number of cluster...
متن کاملImproved Univariate Microaggregation for Integer Values
Privacy issues during data publishing is an increasing concern of involved entities. The problem is addressed in the field of statistical disclosure control with the aim of producing protected datasets that are also useful for interested end users such as government agencies and research communities. The problem of producing useful protected datasets is addressed in multiple computational priva...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007